The violent crime rate in U.S increased by 3.4 percent nationwide in 2016 in US. As an international student, as well as a New Yorker, the public safety in NYC is always a concern to us, especially after the recent terrorists attack near the World Trade Center. Thus, our group decided to make a deeper investigation of the crime data and seek out some underlying reasons which led to the increase of crime rate.
crime_type = nyc_crime_2017 %>%
mutate(prem_typ=as.character(prem_typ),
boro=as.character(boro),
ofns_type=as.character(ofns_type),
ofns_desc=as.character(ofns_desc))
crime_type$prem_typ[grep("RESIDENCE",crime_type$prem_typ)] = "RESIDENCE"
crime_type$prem_typ[crime_type$prem_typ==""]<-"OTHER"
crime_type$prem_typ[grep("COMMERCIAL",crime_type$prem_typ)] = "COMMERCIAL BLDG"
place_offense = function(x){
crime_boro = crime_type %>%
filter(boro == x) %>%
group_by(prem_typ) %>%
summarize(crime = n()) %>%
arrange(desc(crime)) %>%
top_n(4)
top_place = crime_boro$prem_typ
mat <- matrix(ncol = 3, nrow = 4)
for (i in 1:4) {
offense_select = crime_type %>%
filter(prem_typ == top_place[i],
boro == x) %>%
group_by(ofns_desc) %>%
summarize(crime = n()) %>%
arrange(desc(crime)) %>%
top_n(3)
top_offense = offense_select$ofns_desc
mat[i,] = top_offense
}
mat <- cbind(top_place[1:4],mat)
return(mat)
}
boro = crime_type %>% distinct(boro) %>% pull()
place_offense_result = map(boro,place_offense)
boro_plot = vector("list", length = 5)
for (i in 1:5) {
df_boro = as.data.frame(place_offense_result[i])
df_boro = df_boro %>%
rename(place=X1,top_1=X2,top_2=X3,top_3=X4) %>%
gather(key=ofns_rank,value=ofns_desc,top_1:top_3)
boro_ofns = unique(df_boro$ofns_desc)
boro_plot[[i]] = crime_type %>%
filter(prem_typ %in% c("STREET","RESIDENCE"),
ofns_desc %in% boro_ofns) %>%
ggplot(aes(x = prem_typ,fill = ofns_desc)) + geom_bar() + coord_flip() + theme_bw() +
theme(axis.title=element_blank(),legend.position = "bottom")
}
boro_plot[[1]]= boro_plot[[1]] + ggtitle("Figure b - Manhattan" )
boro_plot[[2]]= boro_plot[[2]] + ggtitle("Figure c - Brooklyn")
boro_plot[[3]]= boro_plot[[3]] + ggtitle("Figure d - Staten Island")
boro_plot[[4]]= boro_plot[[4]] + ggtitle("Figure e - Queens")
boro_plot[[5]]= boro_plot[[5]] + ggtitle("Figure f - Bronx")
library(gridExtra)
library(grid)
place_boro = crime_type %>%
group_by(prem_typ) %>%
summarize(crime = n()) %>%
arrange(desc(crime)) %>%
top_n(5)
top_place = place_boro$prem_typ
place_boro_plot = crime_type %>%
filter(prem_typ %in% top_place) %>%
ggplot(aes(x=prem_typ,fill = boro)) + geom_bar() + theme_bw() + theme(axis.text.x = element_text(angle = 15, hjust=1)) +
labs(title = "Figure a - Crime Counts Against Places",
x = "Place",
y = "Counts")
title = textGrob("Commen Criminal Crimes in Each Boro", gp=gpar(fontsize=25))
grid.arrange(place_boro_plot,boro_plot[[1]],
boro_plot[[2]],boro_plot[[3]],
boro_plot[[4]],boro_plot[[5]],
ncol=2, top = title)
** Comments**
* In Figure a, we presented the top five places where crimes usually happlen across five boros in NYC and it shows that STREET and RESIDENCE are the most unsafe places, then we will look furtherly about the major crime types in these two places for each boro.
* We obtian the information of crime counts and types through the width and partitioning of bars. It is obviously to conclude that the prevalence of assault, harrassment and criminal mischief are much higher compared to other crimes in most boros.
* In residence, the occurence of harrassment and assult is more prevalent, while prtit larcency and criminal mischief represent more percentage of criminal types in street.
* Next, we want to compare the major criminal types in different boros. From Figure b-f, we found that the distribution of crimes are similar among Manhattan, Brooklyn, State Island and Queen. However, the characteristic of crimes in Bronx appeares to be more complex. Specificlly, the nature of crimes in Bronx is usually more serious than the other four boros, with the occurence of FELONY ASSAULT and DANGEROUS DRUGS, which belong to felony.
* Overall,the safty level in Manhattan, State Island and Queens in relatively higher
nyc_crime = read_csv("./NYPD_Complaint_Data_Current_YTD.csv") %>%
clean_names() %>%
select(boro = boro_nm)
crime_number = nyc_crime %>%
group_by(boro) %>%
summarise(n = n())
population = read_csv("./NYC_Population_by_Borough.csv") %>%
mutate(boro = Borough) %>%
select(-Borough)
nyc_crime_population = left_join(population, crime_number, by = "boro") %>%
clean_names() %>%
mutate(population = as.numeric(population)) %>%
mutate(crime_rate = n / population * 100000)
income = read_csv("./NYC_Income_by_Borough.csv") %>%
clean_names() %>%
mutate(boro = borough) %>%
select(-borough)
crime_income = left_join(income, nyc_crime_population, by = "boro")
crime_income %>%
ggplot(aes(x = income, y = crime_rate, color = income)) + geom_point(alpha = 0.5) + geom_smooth() +
labs(title = "Corelation between family median income and crime rate in each borough",
x = "Income Range",
y = "Crime rate")
In addition, we have a strong interest in finding potential factors that may associated with criminal rate. In this case, we choose household income level. After reading data from the web, data cleaning and data visualization, we are surprized to see from the scatter plot: Both lower-income borough and higher-income borough have an extremely high crime rate. For example, Bronx borough’s family median income is 35176 dollars, associated with a crime rate of 0.029. That is, we expect 29 crime cases among every 1000 people. In contrast, Family income ranged between 60000 dollars to 70000 dollars tends to have the lowerest crime rate. Taking Queens as an example, we expect only 15 crime cases among every 1000 people.
library(tidytext)
crime_words = nyc_crime_2017 %>%
select(-longitude, -latitude) %>%
mutate(ofns_desc = str_to_lower(ofns_desc),
ofns_desc = str_replace(ofns_desc, "[2-3]",""),
ofns_desc = as.character(ofns_desc)) %>%
unnest_tokens(word, ofns_desc)
data(stop_words)
crime_word_tidy =
anti_join(crime_words, stop_words)
crime_word_tidy %>%
count(word, sort = TRUE) %>%
top_n(10) %>%
mutate(word = fct_reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_bar(stat = "identity", fill = "blue", alpha = .6) +
coord_flip()
The graph analyzes top 10 words showing in offense description. The most frequent one is larceny, which appears nearly 100000 times. Other frequent words including related, petit, assault, harrassment, etc. Most of them indicated the type of crime, which is consistent with what we expect.
word_ratios = crime_word_tidy %>%
filter(ofns_type %in% c("VIOLATION" , "FELONY")) %>%
count(word, ofns_type) %>%
group_by(word) %>%
filter(sum(n) >= 5) %>%
ungroup() %>%
spread(ofns_type, n, fill = 0) %>%
mutate(
violation_odds = (VIOLATION + 1) / (sum(VIOLATION) + 1),
felony_odds = (FELONY + 1) / (sum(FELONY) + 1),
log_OR = log(felony_odds / violation_odds)
) %>%
arrange(desc(log_OR))
word_ratios %>%
mutate(pos_log_OR = ifelse(log_OR > 0, "felony_odds >violation_odds" ,"violation_odds > felony_odds")) %>%
group_by(pos_log_OR) %>%
top_n(10, abs(log_OR)) %>%
ungroup() %>%
mutate(word = fct_reorder(word, log_OR)) %>%
ggplot(aes(word, log_OR, fill = pos_log_OR)) +
geom_col() +
coord_flip() +
ylab("log odds ratio (felony_odds/violation_odds)") +
scale_fill_discrete(name = "") +
theme(legend.position = "bottom")
The above chart compares distinct words(that is, words that appear much more frequently in one group than the other) in offense type of violation and felony. We can see that larceny, robbery, burglary,etc., appear more frequently in offense description of felony crime, while harrassment, gambling, loitering appear more frequently in offense description of violation crime. In terms of the results, we can obtain a basic picture of the difference between felony and violation.